This course covers the basics of reproducible research with R, Rstudio, and Github. My Github repository can be found here: https://github.com/viljahe/IODS-project
date()
## [1] "Thu Nov 26 14:12:05 2020"
I found this course by browsing courses for doctoral students, trying to find something that would actually benefit me in some way: I’m interested in open science, and while I have been using R for some years, Git is something I’m not yet familiar with but have been wanting to learn for a while now. So this was a good opportunity to do just that.
I’m looking forward to learning how Git can make my research more open and more reproducible. I hope I can adapt things I learn from this course as a part of my workflow. Three things I’m most looking forward to/goals for myself for this course:
This week we learned about data wrangling and linear regression. First, we created a data set to be used in the analyses below including creating some summary variables. Then we learned how to conduct a simple multivariate linear regression and how to assess the model diagnostics.
# load the necessary packages and set scientific notation off
options(scipen = 999)
library("GGally")
library("ggplot2")
First, let’s load the data set created for this exercise to R and inspect its dimension and structure using dim() and str():
learning2014 <- read.table("learning2014.txt")
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
The data contains information from 166 students from an introductory statistics course. Data include participants’ gender and age, a measure of global attitudes towards statistics, participants’ scored points in the exam, and three variables which reflect deep learning, surface learning, and strategic learning. The three variables are the mean of several original items asking about participant’s learning approaches in each of the three categories, and the attitude variable is a mean of ten original variables asking about different attitudes towards statistics.
We can inspect the data and the associations between variables visually using the function ggpairs() and look at the most common descriptive statistics with the help of summary() to get a more complete grasp of the data.
p <- ggpairs(learning2014,
mapping = aes(col = gender, alpha = 0.1),
lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 2.5)),
) +
theme_bw() +
theme(strip.background = element_rect(fill = "white"),
axis.text.x = element_text(size=5),
axis.text.y = element_text(size=5))
p
summary(learning2014)
## gender age attitude points
## Length:166 Min. :17.00 Min. :1.400 Min. : 7.00
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:19.00
## Mode :character Median :22.00 Median :3.200 Median :23.00
## Mean :25.51 Mean :3.143 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:27.75
## Max. :55.00 Max. :5.000 Max. :33.00
## deep surf stra
## Min. :1.583 Min. :1.583 Min. :1.250
## 1st Qu.:3.333 1st Qu.:2.417 1st Qu.:2.625
## Median :3.667 Median :2.833 Median :3.188
## Mean :3.680 Mean :2.787 Mean :3.121
## 3rd Qu.:4.083 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.917 Max. :4.333 Max. :5.000
The data seems to contain more women than men, and the participants are mostly under the age of 30. We can see that there are some differences between men and women on how the variables are distributed, most visibly in the global attitudes towards statistics where men have somewhat higher score on average, and the scores are distributed more evenly for women. Comparing the learning approaches, it would seem that deep learning scores skew a bit more towards the higher scores, whereas strategic scores are distributed more evenly around the middle. For men, the surface learning scores seem to be distributed more evenly than for women who have a more visible peak around the middle values of the variable.
There is a significant positive correlation between the global attitudes towards statistics and the exam score (\(r = .437\)), and this is also true for both gender groups. Attitudes towards statistics also seem to be slightly correlated with surface learning with more positive attitude being associated with lower score on surface learning (\(r = -.176\)), but upon closer inspection, it seems that there is no significant correlation for women, and for men the association is stronger than for all participants’ together (\(r = -.374\)). There is also an expected correlation between surface learning and deep learning, with higher scores on surface learning being associated with lower scores on deep learning (\(r = -.324\)), but surprisingly this effect is once again mostly driven by a strong association for men (\(r = -.622\)) whereas for women the learning approaches are not associated with each other.
The associations between variables can be further examined using linear regression. For example, we can see how exam score is explained by the variables that have the strongest correlation with it: attitudes towards statistics, the surface learning approach, and the strategic learning approach.
model1 <- lm(points ~ attitude + surf + stra, data = learning2014)
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + surf + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 0.0000000193 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 0.00000003156
Based on the results of the regression, it would seem that more positive attitudes towards statistics is associated with higher scores on the exam (\(B = 3.40, p <.001\)). This means, that for each one point increase on the scale of attitudes, the exam score increases approximately 3.4 points. The two learning approaches included in the model do not seem to be associated with exam scores. They can therefore be removed from the model.
# remove surf first as it's p-value is the higher of the two
model2 <- update(model1, ~. -surf)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 0.00000000631 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 0.000000007734
# stra is still not significant, remove it as well
model3 <- update(model2, ~. -stra)
summary(model3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 0.00000000195 ***
## attitude 3.5255 0.5674 6.214 0.00000000412 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 0.000000004119
With the surface learning approach first removed from the model, the p-value associated with strategic learning approach decreases, but the effect remains non-significant at the traditional significance level of \(p < .05\). Removing both non-significant predictors from the model does not affect the association between attitudes towards statistics and exam scores greatly. More positive attitudes towards statistics are associated with higher exam scores, and the magnitude of the effect remains roughly the same (\(B = 3.53, p <.001\)). The multiple \(R^2\) is a measure of goodness-of-fit of our model, i.e. it tells how well our model explains the variance of the dependent variable. From the \(R^2\) we can infer that the model explains approximately 19% of the variance of the exam score.
Let’s then examine how well our model meets the assumptions of linear regression. We can do this by inspecting diagnostic plots:
# dev.new(width=7, height=20)
par(mfrow=c(2,2))
plot(model3, c(1, 2, 5))
In the first figure the residuals are compared to the predicted values of the model. The assumption of constant variance of errors is met as the errors are quite randomly spread. From the QQ-plot we can see, that the errors are reasonably normally distributed, following the dashed line quite well. In the third figure we see that no values have a very high leverage, so this assumption is met as well. Based on these, it can be concluded that there are no significant problems with the model.
This week was all about joining two datasets together wrangling and learning logistic regression. Again, in the data wrangling part we created a data set to be used in the analyses below merging two datasets together and creating some new variables. Then we learned how to conduct a logistic regression and assess the predictive power of the model.
# load the necessary packages and set scientific notation off
options(scipen = 999)
library("dplyr")
library("tidyr")
library("ggplot2")
library("gridExtra")
library("Hmisc")
This data consists information about students’ alcohol consumption, school performance in two subjects (mathematics and portuguese language), and some background variables relating to their social environment, family, and school. Data has been collected from two Portuguese schools. More information of the data used can be found here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
alc <- read.csv("alc.csv", sep=",", header=TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures.m"
## [16] "schoolsup" "famsup" "paid.m" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences.m"
## [31] "G1.m" "G2.m" "G3.m" "failures.p" "paid.p"
## [36] "absences.p" "G1.p" "G2.p" "G3.p" "id.p"
## [41] "failures" "paid" "absences" "G1" "G2"
## [46] "G3" "alc_use" "high_use"
The data has been merged from two separate datasets, one specifically for math and other for portuguese language. The background variables are equal between the two original datasets, but the variables relating to school performance (failures, paid, absences, G1, G2, and G3) on these two subjects differ. These variables are the number of past class failures (failures), whether or not student has had extra paid classes (paid), number of absences (absences), and the first, second, and final grades (G1-G3, respectively). The final dataset used here therefore has three separate variables for each subject specific original variable: the two original variables denoted by either .p (for Portuguese) or .m (for math) and one that is the mean of the two original variables (without suffix).
I chose to examine how time spent on studies weekly (studytime), parents’ cohabitation status (Pstatus), number of school absences (absences) and the mean of the final grades in math and Portuguese (G3) affect whether student’s alcohol consumption is high or low (high_use). Study time is a discrete variable with 4 categories: less than 2 hours (1), 2-5 hours (2), 5-10 hours (3), and more than 10 hours (4). Parents’ cohabitation status has two possible values: living together (T) and living apart (A). Number of school absences is a continuous variable ranging from 0 to 93, and alcohol consumption is a categorical variable based on the mean of workday alcohol consumption and weekend alcohol consumption: those with mean of the two variables higher than 2 (ranging from 1, “very low”, to 5 “very high”) have been categorized as having high alcohol consumption.
My hypotheses are that more time spent on studying and higher grades are connected to low alcohol consumption, while higher number of absences and parents living apart are connected to high alcohol consumption.
Let’s have a closer look at the chosen variables and how they relate to each other.
# subset for only relevant variables
alc.sub <- subset(alc, select = c(studytime, absences, G3, Pstatus, high_use))
# see numerical summaries of variables
summary(alc.sub[(-4)])
## studytime absences G3 high_use
## Min. :1.000 Min. : 0.000 Min. : 0.00 Mode :logical
## 1st Qu.:1.000 1st Qu.: 1.000 1st Qu.:10.00 FALSE:259
## Median :2.000 Median : 3.000 Median :12.00 TRUE :111
## Mean :2.043 Mean : 4.511 Mean :11.52
## 3rd Qu.:2.000 3rd Qu.: 6.000 3rd Qu.:14.00
## Max. :4.000 Max. :45.000 Max. :18.00
table(alc.sub["Pstatus"])
##
## A T
## 38 332
# barplots for the distributions
# library(purrr)
numplot <- alc.sub[c("G3", "studytime", "absences")] %>% gather() %>% ggplot(aes(as.numeric(value))) + facet_wrap("key", scales = "free") + geom_bar() + xlab("value") + theme_bw() + theme(strip.background = element_rect(fill = "white"))
discplot <- alc.sub[c("Pstatus", "high_use")] %>% gather() %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme_bw() + theme(strip.background = element_rect(fill = "white"))
grid.arrange(numplot, discplot, nrow=2)
We can see that most of the students have quite low number of absences, with a few outliers who have over 40. The mean of the final grades is 11.5, and the distribution is somewhat skewed towards higher grades. Study time is mostly below 5 hours per week. There are around 1.5 times more of those whose alcohol consumption is categorized as low compared to high and most students’ parents live together.
# box plots for alcohol use vs. continuous variables
box1 <- ggplot(alc.sub, aes(x = high_use, y = G3)) +
geom_boxplot() +
ylab("grade") +
ggtitle("Alcohol use and final grades") +
theme_bw()
box2 <- ggplot(alc.sub, aes(x = high_use, y = absences)) +
geom_boxplot() +
ylab("absences") +
ggtitle("Alcohol use and absences") +
theme_bw()
# bar plots for alcohol use vs. discrete variables
bar1 <- alc.sub %>% ggplot(aes(x = studytime, fill = high_use)) +
geom_bar(position = "dodge") +
xlab("Study time") +
ggtitle("Alcohol use and study time") +
theme_bw()
bar2 <- alc.sub %>% ggplot(aes(x = Pstatus, fill = high_use)) + geom_bar(position = "dodge") +
xlab("Parental status") +
ggtitle("Alcohol use and parental status") +
theme_bw()
# draw the plots
grid.arrange(box1, box2, bar1, bar2, ncol=2)
From the initial glance it doesn’t seem like alcohol use and mean of the final grades or number of absences are connected. The study time per week seems to be a bit lower for those whose alcohol consumption is high, and those whose parents live apart might be overrepresented in high alcohol users as well, although as the number of parents living apart in this data is very low it’s hard to draw any definite conclusions based on these figures alone.
# correlations for numerical variables
corrs <- rcorr(as.matrix(alc.sub[(-4)]))
corrs
## studytime absences G3 high_use
## studytime 1.00 -0.12 0.17 -0.21
## absences -0.12 1.00 -0.10 0.22
## G3 0.17 -0.10 1.00 -0.13
## high_use -0.21 0.22 -0.13 1.00
##
## n= 370
##
##
## P
## studytime absences G3 high_use
## studytime 0.0254 0.0007 0.0000
## absences 0.0254 0.0484 0.0000
## G3 0.0007 0.0484 0.0112
## high_use 0.0000 0.0000 0.0112
# crosstabs for parental status and high use
table(alc.sub[c("Pstatus", "high_use")])
## high_use
## Pstatus FALSE TRUE
## A 26 12
## T 233 99
The rcorr() function prints us the correlations and the associated p-values of the variables. We can see that both study time and absences are statistically significantly correlated with alcohol use, with more time spent on studies being associated with low alcohol consumption (\(r = -.21, p<.001\)) and higher number of absences with high alcohol consumption (\(r = .22, p<.001\)). This would seem to support my hypotheses. Higher grades are also associated with more time spent on studying (\(r = .17, p<.001\)), but not statistically significantly to alcohol use. Although the direction of the association between mean of final grades and alcohol use is in line with the hypothesis.
The proportion of high alcohol users in both parental residential status groups is approximately the same, which does not support my hypothesis.
We can then use logistic regression to see if the results support my hypotheses.
# first, specify studytime as factor as it is actually a categorical variable
alc.sub$studytime <- as.factor(alc.sub$studytime)
# specify model
model1 <- glm(high_use ~ studytime + absences + G3 + Pstatus, data = alc.sub, family = "binomial")
# print summary
modelsum1 <- summary(model1)
modelsum1
##
## Call:
## glm(formula = high_use ~ studytime + absences + G3 + Pstatus,
## family = "binomial", data = alc.sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1429 -0.8596 -0.6513 1.1840 2.1990
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.22075 0.62094 -0.356 0.722211
## studytime2 -0.46568 0.26686 -1.745 0.080987 .
## studytime3 -1.36077 0.44121 -3.084 0.002041 **
## studytime4 -1.28496 0.58511 -2.196 0.028086 *
## absences 0.07614 0.02311 3.294 0.000986 ***
## G3 -0.05500 0.03682 -1.494 0.135216
## PstatusT 0.13698 0.40095 0.342 0.732632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 416.24 on 363 degrees of freedom
## AIC: 430.24
##
## Number of Fisher Scoring iterations: 4
# retrieve odds ratios, confidence intervals and p-values
OR <- coef(model1) %>% exp()
CI <- confint(model1) %>% exp()
p <- coef(modelsum1)[,4]
# print table of odds ratios, confidence intervals and p-values
cbind(OR, CI, p) %>% round(3)
## OR 2.5 % 97.5 % p
## (Intercept) 0.802 0.233 2.684 0.722
## studytime2 0.628 0.372 1.061 0.081
## studytime3 0.256 0.102 0.585 0.002
## studytime4 0.277 0.076 0.797 0.028
## absences 1.079 1.034 1.132 0.001
## G3 0.946 0.880 1.017 0.135
## PstatusT 1.147 0.535 2.608 0.733
In the specified model, weekly study time and absences are both associated statistically significantly to alcohol use. Those who study 5-10 hours (\(OR = 0.26, CI = 0.10, 0.59, p =.002\)) and over 10 hours per week (\(OR = 0.28, CI = 0.08, 0.80, p =.028\)) are less likely to be high alcohol users than those who study less than 5 hours per week, while those with more absences are more likely to be high alcohol users (\(OR = 1.08, CI = 1.04, 1.13, p <.001\)). The mean of final grades and parental status are not associated with alcohol use.
The hypotheses regarding study time and absences was confirmed, but the results don’t support the hypotheses about final grades and parental residential status being associated with alcohol use.
Only two of the hypothesized variables were connected to alcohol use. Let’s see how well a model which includes these two variables predicts alcohol use.
model2 <- update(model1,.~. -G3 -Pstatus)
summary(model2)
##
## Call:
## glm(formula = high_use ~ studytime + absences, family = "binomial",
## data = alc.sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1907 -0.8577 -0.7137 1.1984 2.1174
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.69160 0.23897 -2.894 0.003802 **
## studytime2 -0.50842 0.26455 -1.922 0.054631 .
## studytime3 -1.43763 0.43660 -3.293 0.000992 ***
## studytime4 -1.36272 0.58234 -2.340 0.019279 *
## absences 0.07788 0.02305 3.379 0.000727 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 418.70 on 365 degrees of freedom
## AIC: 428.7
##
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use
probabilities <- predict(model1, type = "response")
# add the predicted probabilities to 'alc'
alc.sub <- mutate(alc.sub, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc.sub <- mutate(alc.sub, prediction = ifelse(probability > 0.5, T , F))
# 2x2 cross table of alcohol use vs. the predictions
table(high_use = alc.sub$high_use, prediction = alc.sub$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 247 12
## TRUE 93 18
From the crosstabulation, we can see that the majority fall into the category where both predicted and actual alcohol use was low. Of the actual high alcohol users, most were predicted to be low alcohol users. All together, 93 persons were wrongly classified as low alcohol users, and 12 persons were wrongly classified as high alcohol users. This is not very surprising, given that the number belonging to each category in our data was very unequal, with most students being low alcohol users.
We can also examine predictions and actual alcohol use visually by drawing a scatterplot, and calculate the percentage of wrongly categorized individuals:
# explore the predictions graphically
g.pred <- ggplot(alc.sub, aes(x = probability, y = high_use, col = prediction))
g.pred + geom_point(alpha = 0.5) + geom_jitter(height=0.1) + theme_bw()
# calculate and print the total proportion of wrong classifications
prop_wrong <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
perc <- sum(n_wrong)/length(class) * 100
print(paste(round(perc,2), "% classified wrong"))
}
prop_wrong(class = alc.sub$high_use, prob = alc.sub$probability)
## [1] "28.38 % classified wrong"
The percentage of wrongly categorized individuals based on our model is 28.38 %. Although this number is quite high, it is notably better than simply guessing which would result in approximately 50% wrong guesses.
This week we learned about clustering and classification methods.
# load the necessary packages and set scientific notation off
options(scipen = 999)
library("MASS")
library("ggplot2")
library("dplyr")
library("GGally")
In this week analyses we use the Boston dataset from Mass package. The data describes housing values in suburbs of Boston from the 1970s. More information can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
# load the data
data("Boston")
# explore the structure and dimensions
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston data consists 506 observations and 14 variables, which include town’s per capita crime rate, a number of variables reflecting the average socioeconomic status of the area and the demographics or the residents, general information about the location (e.g. proximity highways and the Charles River) and air quality.
Let’s have a closer look at the data and the variables.
plot1 <- ggpairs(Boston,
mapping = aes(alpha = 0.1),
lower = list(continuous = wrap("points", size = 0.3)),
upper = list(continuous = wrap("cor", size = 2.5)),
) +
theme_bw() +
theme(strip.background = element_rect(fill = "white"),
axis.text.x = element_text(size=4),
axis.text.y = element_text(size=5))
plot1
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Some of the variables are very heavily skewed towards lower values, most notably crim (the per capita rate by town), zn (proportion of residential land zoned for large lots) and chas (a dummy variable describing whether or not the area is next to the Charles River). Therefore the data consists of areas that mostly have low crime rates, smaller residential lots, and are not next to the river. The variables dis (distance to employment centers) and lstat (percent of the lower status of the population) are also skewed towards the lower values, so most areas are close to the employment centers and have a low percent of low status residents. On the other hand, variables age, ptratio, and black are skewed towards higher values, meaning the data includes more areas with older houses, high pupil-teacher ratio, and fewer black people. A couple of variables have very clear two peaks: indus (proportion of non-retail business acres), rad (accessibility to radial highways), and tax (property tax-rate). The variable nox, describing the nitrogen oxides concentration also has mostly quite low values, while medv, median value of owner-occupied homes, has its peak around the middle values.
There are a number of substantial correlations between the variables, and we can instantly see some interesting patterns in the data from the scatterplots, e.g. the nitrogen oxides concentration seems to be higher when the houses in the area are older, and lower when the distance to the employment centres is higher, and higher percent of the population that is of lower status seems to be associated with a decrease in the median value of owner-occupied homes. Nothing too surprising.
Next we can standardize the data using the function scale.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
This takes each value, calculates the difference between it and the variable mean, and divides it by the standard deviation of the variable. The scaled values represents by how many standard deviations does the observation differ from the variable mean. The means of each scaled variables is now 0 and the values that were originally below the mean are negative, and values above the mean are positive. For example, the maximum value of crime rates in the original data is 88.9762, which is 85.3626764 above the mean: the standard deviation of crime rates is 8.6015451, so the maximum value is 85.3626764 / 8.6015451 = 9.9241096 above the mean. This is the new maximum value in the scaled data.
Let’s then replace the crime rate variable with a categorical variable of the crime rate using quantiles as the break points.
# change the class to dataframe
boston_scaled <- as.data.frame(boston_scaled)
# define the breaks
bins <- quantile(boston_scaled$crim)
# create the categorical variable
boston_scaled$crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# remove the original crime rate variable
boston_scaled <- dplyr::select(boston_scaled, -crim)
First, we will divide the dataset to train and test sets: we will first use the train set to do train our classification model and then see how well our model can predict the classes in the test dataset.
# choose randomly 80% of the rows in the data for the train set
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Next, we will fit the linear discriminant analysis on the train set using the categorical crime rate variable crime as the target variable, and all the other variables as predictors.
# Fit the model
lda.crime <- lda(crime ~., data = train)
lda.crime
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2524752 0.2400990 0.2574257
##
## Group means:
## zn indus chas nox rm age
## low 0.98289371 -0.9048904 -0.077423115 -0.8875355 0.44606813 -0.8480603
## med_low -0.05489573 -0.3067947 -0.002135914 -0.6118687 -0.14512558 -0.4031577
## med_high -0.37938430 0.1687756 0.214734879 0.4008474 0.03253615 0.4187731
## high -0.48724019 1.0170690 -0.007331936 1.0733064 -0.39638855 0.7963535
## dis rad tax ptratio black lstat
## low 0.8692695 -0.6896375 -0.7531292 -0.5076793 0.38074461 -0.75932271
## med_low 0.4143469 -0.5517590 -0.5091792 -0.1384105 0.31451083 -0.12411300
## med_high -0.3658826 -0.4253975 -0.3115961 -0.3008897 0.07578919 0.04279428
## high -0.8537986 1.6386213 1.5144083 0.7813507 -0.85054697 0.86151464
## medv
## low 0.54492676
## med_low 0.01860534
## med_high 0.11144004
## high -0.69494268
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09147076 0.69859943 -1.0425885
## indus 0.03828461 -0.08090308 0.4437025
## chas -0.07753352 -0.05415673 0.1023825
## nox 0.37613747 -0.84622651 -1.3336012
## rm -0.12901120 -0.05574213 -0.1490547
## age 0.19891883 -0.30826376 -0.2482416
## dis -0.04159272 -0.22088373 0.2985786
## rad 3.29600927 1.12743424 0.0381204
## tax 0.12024734 -0.20148469 0.4709778
## ptratio 0.11564493 -0.01173114 -0.3306234
## black -0.12303903 0.05421779 0.1376763
## lstat 0.16288020 -0.11892018 0.4530085
## medv 0.18521186 -0.29111204 -0.1273527
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9527 0.0336 0.0137
# Draw the LDA biplot
# target classes as numeric
classes <- as.numeric(train$crime)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# define the colors for the plot from green (low) to red (high)
cols <- c("green", "blue", "purple", "red")
cols1 <- cols[classes]
# plot the lda results
plot(lda.crime, dimen= 2, col = cols1, pch = classes)
lda.arrows(lda.crime, myscale = 2)
We can see from the plot that high crime rates cluster together very clearly separate from the other categories. The other categories also form clusters, although there is much more overlap between the categories. The arrows show that the variable rad (the accessibility of radial highways) contributes most to the separation of the groups: those in the high crime rate group also have clearly higher values on this variable compared to other groups. Variables zn and nox seem to contribute more to the separation between the low, medium low, and medium high groups, with higher nitrogen oxides levels being related to higher crime rates, and proportion of large residential plots being related to lower crime rates. The same can be seen from the table of group means.
# save crime rate categories from test data
correct_classes <- test$crime
# remove original crime rate categories
test <- dplyr::select(test, -crime)
# predict classes with the model on the test data
lda.pred <- predict(lda.crime, newdata = test)
# crosstabulate results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 11 0 0
## med_low 2 12 10 0
## med_high 0 7 20 2
## high 0 0 0 23
It looks like that the predictions are pretty good: in all categories, the correct predictions are in the majority. The “high” group was predicted most accurately, which can also be seen from the previous plot, as the high values formed most clearly a separate cluster.
Let’s reload the Boston data, standardize it again, inspect the distances, and run the k-means algorithm on this data.
# reload the Boston data
data("Boston")
# standardize the dataset
boston_scaled <- scale(Boston)
# calculate distances between observations
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# run the k-means algorithm
km <- kmeans(boston_scaled, centers = 4)
km
## K-means clustering with 4 clusters of sizes 144, 188, 34, 140
##
## Cluster means:
## crim zn indus chas nox rm age
## 1 -0.4027600 1.0605053 -0.9558406 -0.2449881 -0.8845708 0.8024032 -1.0064759
## 2 -0.3722714 -0.4023982 -0.1279278 -0.2723291 -0.1953028 -0.2837133 0.1233787
## 3 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.2756681 0.3721322
## 4 0.9623940 -0.4872402 1.0869401 -0.2723291 1.0790746 -0.5112906 0.7791774
## dis rad tax ptratio black lstat
## 1 0.96581780 -0.589478299 -0.6726655 -0.6925272 0.3580834 -0.899561938
## 2 -0.04204253 -0.594569021 -0.5239892 0.1193084 0.2658321 -0.005441924
## 3 -0.40333817 0.001081444 -0.0975633 -0.3924585 0.1715427 -0.164352523
## 4 -0.83900192 1.404479156 1.4192210 0.6474109 -0.7669493 0.972485618
## medv
## 1 0.8660455
## 2 -0.1834340
## 3 0.5733409
## 4 -0.7837039
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 2 2 1 1 1 1 1 1 2 1 1 2 1 1 2 2 2 2 2 2
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 4 3 4 4 4 4 4 4 4 2 2 3 4 3 3 4 2 2 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 3 1 3 3 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 1 1 1 1 1 2 2 2 3 3 3 3 3 2 2 2 3 2 3 3
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 3 3 3 2 1 1 1 2 1 1 2 1 1 1 3 2 3 1 1 1
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 1 1 1 1 1 2 2 1 1 3 2 1 2 3 3 1 3 3 1 1
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 1 1 3 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 2 2 2 2 2 1 2 2 2 1 1 1 1 2 2 2 2 2 2 2
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 3 3 3 4
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 4 4 4 3 3 4 4 4 4 3 3 4 3 4 4 4 4 4 4 4
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2
## 501 502 503 504 505 506
## 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 1016.7319 844.8909 340.7321 1251.0784
## (between_SS / total_SS = 51.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The k-means algorithm needs the number of clusters to be specified. I chose 4 clusters at random, but there is a better way to determine the appropriate number of clusters. We can do this by inspecting the total within cluster sum of squares (TWCSS):
set.seed(42)
# determine the maximum number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The plot shows TWCSS on y-axis and how it changes with the number of clusters on x-axis. Here I chose to plot the TWCSS for 1-10 clusters (as it is very unlikely that the solution would be better for a larger number of clusters). The rule of thumb for choosing the appropriate number of clusters is to find a “knot” in the line, e.g. the point where there is a big drop in the TWCSS. In this case, it seems that 2 is a good number of clusters.
Let’s run the k-means algorithm with two clusters.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
km
## K-means clustering with 2 clusters of sizes 329, 177
##
## Cluster means:
## crim zn indus chas nox rm
## 1 -0.3894158 0.2621323 -0.6146857 0.002908943 -0.5823397 0.2446705
## 2 0.7238295 -0.4872402 1.1425514 -0.005407018 1.0824279 -0.4547830
## age dis rad tax ptratio black lstat
## 1 -0.4331555 0.4540421 -0.5828749 -0.6291043 -0.2943707 0.3282754 -0.4530491
## 2 0.8051309 -0.8439539 1.0834228 1.1693521 0.5471636 -0.6101842 0.8421083
## medv
## 1 0.3532917
## 2 -0.6566834
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1
## 501 502 503 504 505 506
## 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 2686.045 1890.637
## (between_SS / total_SS = 35.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# plot the Boston dataset with clusters
boston_scaled <- as.data.frame(boston_scaled)
plot2 <- ggpairs(boston_scaled,
mapping = aes(col = as.factor(km$cluster), alpha = 0.1),
lower = list(continuous = wrap("points", size = 0.1)),
upper = "blank",
switch = "both")+
theme_bw() +
theme(strip.background = element_rect(fill = "white"),
axis.text.x = element_text(size=4),
axis.text.y = element_text(size=5))
plot2
We can see that cluster 1 is described by somewhat higher crime rates and distinctively lower amount of large residential plots, higher nitrogen oxides concentration and more older buildings. It also contains areas with shorter distance to employment centres, higher accessibility to highways and property tax rate, as well as higher pupil-teacher ratio, and higher percent of lower status population. Cluster 2 seems to have low crime rate, high proportion of large residential plots, lower amount of non-retail business land, and lower nitrogen oxides concentration, as well as lower accessibility to highways, and lower number of blacks and percent of low status population. In short, our two clusters seem to describe two types of areas, one with generally more desirable attributes for a residential area and quite likely residents of higher socioeconomic status, and other with less desirable attributes and residents of lower socioeconomic status. If I would have to guess, I’d say these areas probably differ in their proximity to the city proper, with cluster 1 describing more urban areas and cluster 2 describing more suburban areas. ***
This week we learned about dimensionality reduction techniques, such as principal component analysis and multiple correspondence analysis.
# load the necessary packages and set scientific notation off
options(scipen = 999)
library("dplyr")
library("GGally")
library("tidyr")
In this week analyses we use a data set that combines the Human Development Index and Gender Inequality Index data sets. More information can be found here: http://hdr.undp.org/en/content/human-development-index-hdi
# load the data
human <- read.csv("~/Desktop/kurssit-syksy-2020/IODS-project/data/human.csv", row.names = 1)
# summaries
summary(human)
## edu.ratio lab.ratio life.exp edu.exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## gni mmr birthrate parliament
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The data consists of 155 countries and 8 variables that reflect human development and gender inequality:
edu.ratio is the ratio of the precentage of women to the percentage of men with some secondary educationlab.ratio is the ratio of the percentage of women to the percentage of men over the age 15 in labour forcelife.exp is life expectancy at birth in yearsedu.exp is expected years of schoolinggni is the gross national income (GNI) per capitammr is maternal mortality ratiobirthrate is adolescent birth rateparliament is the percentage of seats in parliament held by womenLet’s have closer look at the distributions of and relationships between the variables.
# graphical overview
plot1 <- ggpairs(human,
mapping = aes(alpha = 0.1),
lower = list(continuous = wrap("points", size = 0.5)),
upper = list(continuous = wrap("cor", size = 3)),
) +
theme_bw() +
theme(strip.background = element_rect(fill = "white"),
axis.text.x = element_text(size=4),
axis.text.y = element_text(size=5))
plot1
We can see that in most countries the ratio of women to men with some second degree education is close to 1, which means that almost equal proportion of women and men have secondary education. The labour force ratio, on the other hand, has it’s peak a bit below 1, so a larger proportion men participate in the labour force compared to women. Life expectancy is most commonly around 70-80 years, with an understandable left skew as human lifespan is limited. The expected years of schooling seems to be quite normally distributed, with most common values around 10-15 years. The GNI has a heavy right skew, only couple of countries reaching GNI over 50 000 per capita. Maternal mortality ratio and adolescent birth rate are also skewed to the right, with low values being the most prevalent. Finally, the percentage of parliamentary seats held by women is well below 50 % in most countries.
The secondary education gender ratio seems to correlate with all other variables expect the labour force participation gender ratio and proportion of women in the parliament. Countries with a higher proportion of women with secondary education tend to have higher life expectancy, expected education, and GNI, as well as lower maternal mortality and adolescent birth rate. Higher proportion of women participating in the labour force is associated with higher maternal mortality (which seems counter intuitive and might be some sort of statistical fluke) and higher proportion of women in the parliament. Higher life expectancy is associated with higher expected education and higher GNI, and lower maternal mortality, and adolescent birth rate. Countries with higher GNI tend to have lower maternal mortality and adolescent birth rate, and high maternal mortality is associated with high adolescent birth rate.
Let’s run a principal component analysis on the human development data.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
summary_human <- summary(pca_human)
summary_human
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 18544.1639 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 0.9999 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 0.9999 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# extract the variances captured for the plot
pca_pr1 <-round(100 * summary_human$importance[2, ], digits = 1)
pcalab1 <- paste0(names(pca_pr1), "(",pca_pr1, "%)")
# plot
biplot(pca_human, choices = 1:2, cex = c(0.5, 0.6), col = c("grey80", "black"), xlab = pcalab1[1], ylab = pcalab1[2])
Figure 1. The first two component of PCA on unstandardized data. The 1st component explains 100 % of the variability and is largely determined by the GNI: countries with higher GNI place to the left side of the plot.
Well, this seems messy. The first component captures 99.99 % of the variance, and the GNI seems to be the only variable with really any effect. This is because PCA is sensitive to the scaling of the variables: features with large variance are assumed to be more important than those with small variance. And the GNI, as it is on a completely different scale than the rest of the variables with values ranging from hundreds to hundreds of thousands, most definitely has the largest variance in our data. It’s probably better to standardize the data so that the scales are comparable.
# Standardize the dataset
human_std <- scale(human)
# perform pca to the standardized data
pca_humanstd <- prcomp(human_std)
summary_humanstd <- summary(pca_humanstd)
summary_humanstd
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# extract the variances captured for the plot
pca_pr2 <-round(100 * summary_humanstd$importance[2, ], digits = 1)
pcalab2 <- paste0(names(pca_pr2), "(",pca_pr2, "%)")
# plot
biplot(pca_humanstd, choices = 1:2, cex = c(0.5, 0.6), col = c("grey80", "black"), xlab = pcalab2[1], ylab = pcalab2[2])
Figure 2. The first two component of PCA on standardized data. The 1st component explains 53.6 % of the variability and is determined by the GNI, expected years of schooling, life expectancy, the ratio of the percentages of women to men with some sencondary education, maternal mortality, and adolescent birthrate. The second component explains 16.2 % of the variability, and is determined by the percentage of parliamentary seats held by women, and the ratio of the percentages of women to men in labour force.
This looks better. Now that the variables are on a comparable scales, the first components explains 53.61 % of the variability, the second 16.24 %, and the rest of the six components explain less than 10 % each.
The first principal component is defined by two sets of variables: expected education, life expectancy, secondary education ratio, and the GNI one one hand, and maternal mortality and adolescent birthrate on the other hand. Countries that place higher on the first principle component have lower life expectancy, expected education, GNI, and less women with secondary education compared to men, and also high maternal mortality and adolescent birthrate. The second principal component reflects the labour force participation gender ratio and proportion of women in the parliament. The lower the country is placed on the second principal component, the lower proportion of women they have in the parliament and the labour force. So roughly speaking, in the upper left corner, we have countries that do well on the human development and gender equality indicators, and in the bottom right we would have countries that don’t do so well in terms of human development or gender equality.
The two components could perhaps be labeled well-being and development (1st component) and equal gender representation (2nd component). Although the gender ratio of secondary education could be thought of as an indicator of gender equality, it contributes clearly to the 1st component together with more “general” indicators of development and well-being. This is somewhat surprising, but perhaps the gender equality in secondary education tells us more about the availability of education in general, and gender representation in the parliament and labour force are a better measure of gender equality. If we look at the countries that place to the left on the first component but not very high on the second component, we can see that there are such countries as Japan and Korea: developed countries where traditional gender roles still largely hold.
Multiple correspondence analysis is a dimensionality reduction technique that can be used when data consists of categorical variables. We can try this out with the tea data set from the FactoMineR package.
# install and load package Factominer
require(FactoMineR)
## Loading required package: FactoMineR
# load the dataset ´tea´
data(tea)
# structure and dimensions of the data
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
The data contains 300 observations, and 36 variables that reflect some tea related habits and some background information. Unfortunately there is very little information available about the exact nature of these variables, or at least I couldn’t find any. Anyway, we can examine the categorical variables (all but age) more closely with barplots.
# visualize the data
gather(tea[,1:12]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle=45, hjust = 1, size = 8))
gather(tea[,c(13:18, 20:25)]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle=45, hjust = 1, size = 8))
gather(tea[,26:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle=45, hjust = 1, size = 8))
As the data has so many variables, we can choose a couple to run the MCA with. I decided to go with the variables which I could actually interpret in the absence of proper data documentation, and which seemed to make at least some sense to examine together. I chose the variables Tea reflecting whether the person prefers black, Earl Grey, or green tea; the very confusingly named How and how reflecting whether the person likes their tea plain, with lemon, milk, or something other, and if they use bags, loose leaf or both (respectively); sugar which tells us whether or not person drinks their tea with sugar; and where reflecting whether the person buys their tea from chain store, tea shop, or both.
# As the data contains a lot of variables, let's choose only some for easier interpretations
new.tea <- subset(tea, select = c("Tea", "How", "sugar", "how", "where"))
# MCA
mca <- new.tea %>% MCA(graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = ., graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.335 0.309 0.257 0.224 0.198 0.184 0.172
## % of var. 16.762 15.441 12.866 11.181 9.924 9.188 8.625
## Cumulative % of var. 16.762 32.204 45.069 56.250 66.174 75.362 83.987
## Dim.8 Dim.9 Dim.10
## Variance 0.141 0.105 0.075
## % of var. 7.031 5.249 3.734
## Cumulative % of var. 91.017 96.266 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## 1 | -0.325 0.105 0.088 | -0.288 0.090 0.069 | 0.305 0.120 0.078
## 2 | -0.259 0.067 0.037 | -0.074 0.006 0.003 | 0.789 0.806 0.343
## 3 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217 0.061 0.070
## 4 | -0.580 0.334 0.481 | -0.328 0.116 0.154 | -0.290 0.109 0.120
## 5 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217 0.061 0.070
## 6 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217 0.061 0.070
## 7 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217 0.061 0.070
## 8 | -0.259 0.067 0.037 | -0.074 0.006 0.003 | 0.789 0.806 0.343
## 9 | 0.155 0.024 0.012 | 0.974 1.025 0.461 | -0.005 0.000 0.000
## 10 | 0.521 0.270 0.142 | 0.845 0.770 0.373 | 0.612 0.485 0.196
##
## 1 |
## 2 |
## 3 |
## 4 |
## 5 |
## 6 |
## 7 |
## 8 |
## 9 |
## 10 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## black | 0.473 3.294 0.073 4.681 | 0.198 0.627 0.013 1.961 |
## Earl Grey | -0.265 2.689 0.126 -6.147 | 0.088 0.319 0.014 2.033 |
## green | 0.487 1.558 0.029 2.962 | -0.956 6.515 0.113 -5.813 |
## alone | -0.017 0.011 0.001 -0.405 | -0.251 2.643 0.117 -5.905 |
## lemon | 0.668 2.926 0.055 4.059 | 0.458 1.497 0.026 2.787 |
## milk | -0.338 1.428 0.030 -3.010 | 0.220 0.659 0.013 1.963 |
## other | 0.287 0.148 0.003 0.873 | 2.207 9.465 0.151 6.712 |
## No.sugar | 0.247 1.879 0.065 4.414 | 0.061 0.123 0.004 1.084 |
## sugar | -0.264 2.009 0.065 -4.414 | -0.065 0.132 0.004 -1.084 |
## tea bag | -0.607 12.466 0.482 -12.008 | -0.336 4.134 0.147 -6.637 |
## Dim.3 ctr cos2 v.test
## black 1.045 20.945 0.358 10.342 |
## Earl Grey -0.462 10.689 0.386 -10.737 |
## green 0.360 1.110 0.016 2.190 |
## alone 0.150 1.136 0.042 3.534 |
## lemon -1.561 20.842 0.301 -9.491 |
## milk 0.093 0.141 0.002 0.828 |
## other 1.826 7.773 0.103 5.552 |
## No.sugar 0.621 15.481 0.412 11.100 |
## sugar -0.664 16.548 0.412 -11.100 |
## tea bag 0.069 0.211 0.006 1.367 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.115 0.421 |
## How | 0.076 0.220 0.385 |
## sugar | 0.065 0.004 0.412 |
## how | 0.708 0.503 0.008 |
## where | 0.701 0.702 0.061 |
# plot the
plot(mca, invisible=c("var"))
plot(mca, invisible=c("ind"), habillage = "quali")
From the first plot we can see how individuals place on the first two dimension. There might be some clustering visible, perhaps one group that is located near the middle on both dimension, second that is higher on the second dimension, a third high on the first dimension but low on the second dimension.
From the second plot we can see how the variables contribute to the dimensions and relate to each other. The first two dimensions account for 16.76 and 15.44 % of the variance. We can see, that buying tea from both, chain store and tea shop and buying both, bags and loose leaf are associated, as are buying tea from tea shop and loose leaf only. They also separate individuals in the second dimension as the two groups are far away from each other in that dimension. Those who use tea bags from chain store are far away from the loose leaf from tea shop persons on the first dimension. The rest of the variables are cluster together pretty closely, except for green tea and the “other” additions to the tea (I still have little clue what this category could contain) which both seem to be linked to the second dimension. If I’m interpreting this correctly, I think we might see some separation between the fancy tea drinkers who like loose leaf from tea shops, the “anything goes” who mix bags and loose leaf from chain stores and tea shops and enjoy it with some mystery condiments, and the regular every day tea drinkers who prefer bags from chain stores and the more common condiments.
(more chapters to be added similarly as we proceed with the course!)